Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pylab as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("vehicle.csv")
df.shape
df.dropna(inplace=True)
df.shape
df.describe().T
Since the variable is categorical, you can use value_counts function
sns.pairplot(df, hue='class', diag_kind='kde')
df.isna().sum()
from scipy.stats import zscore
Since the dimensions of the data are not really known to us, it would be wise to standardize the data using z scores before we go for any clustering methods. You can use zscore function to do this
num_df = df.drop('class', axis=1)
class_df = df['class']
num_df = num_df.apply(zscore)
vdf = num_df.join(class_df)
vdf.head()
distortion = []
Iterating values of k from 1 to 10 fit K means model Using c distance - Get the measure for Sum of squares error.
vdf_attributes = vdf.drop('class', axis=1)
Iterating values of k from 1 to 10 fit K means model Using c distance - Get the measure for Sum of squares error.
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
#Iterating values of k from 1 to 10 fit K means model
for k in clusters:
model=KMeans(n_clusters=k).fit(vdf_attributes)
dist = sum(np.min(cdist(vdf_attributes, model.cluster_centers_, 'euclidean'), axis=1))
distortion.append( dist / vdf_attributes.shape[0])
distortion
plt.plot(range(1,10), distortion, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Use Matplotlib to plot the scree plot - Note: Scree plot plots distortion vs the no of clusters
# Lets use k = 3
final_model=KMeans(3)
final_model.fit(vdf_attributes)
prediction=final_model.predict(vdf_attributes)
#Append the prediction
vdf["Group"] = prediction
print("Group Assigned : \n")
vdf[["class", "Group"]]
Note: Since the data has more than 2 dimension we cannot visualize the data. As an alternative, we can observe the centroids and note how they are distributed across different dimensions
u=final_model.cluster_centers_
u
You can use kmeans.clustercenters function to pull the centroid information from the instance
sns.scatterplot(x=vdf['class'], y=vdf['Group'], data=vdf_attributes, palette='spring')
plt.scatter(final_model.cluster_centers_[:,1], final_model.cluster_centers_[:,2], s=100, marker='s', c='red', label='Centroids')
vdf.boxplot(by = 'Group', layout=(4,5), figsize=(15, 10))
vdf_attributes.columns
Hint: Use pd.Dataframe function
centroids_df = pd.DataFrame(final_model.cluster_centers_,columns=vdf_attributes.columns)
centroids_df
final_model.labels_
vdf_attributes['labels'] = final_model.labels_
vdf_attributes.groupby(['labels']).count()
For Hierarchical clustering, we will create datasets using multivariate normal distribution to visually observe how the clusters are formed at the end
np.random.seed(101) # for repeatability of this dataset
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
c = np.random.multivariate_normal([10, 20], [[3, 1], [1, 4]], size=[100,])
print(a.shape)
print(b.shape)
print(c.shape)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,]) b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,]) c = np.random.multivariate_normal([10, 20], [[3, 1], [1, 4]], size=[100,])
hcl_df = np.concatenate((a,b,c), axis=0)
hcl_df.shape
hcldf = pd.DataFrame(hcl_df,index=hcl_df[:,0])
hcldf.head()
sns.pairplot(hcldf, diag_kind='kde')
from sklearn.cluster import AgglomerativeClustering
Use ward as linkage metric and distance as Eucledian
hcmodel = AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='ward')
hcmodel.fit(hcldf)
hcmodel.labels_
hcldf['labels'] = hcmodel.labels_
hcldf.groupby(["labels"]).count()
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist
Z = linkage(hcldf, 'ward')
c, coph_dists = cophenet(Z , pdist(hcldf))
c
distDf = pd.DataFrame(list(coph_dists), columns=['clusterDist'])
distDf['ecludianDist'] = pdist(hcldf)
distDf.head(5)
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z)
plt.tight_layout()
dendrogram(Z,truncate_mode='lastp',p=12)
plt.tight_layout()
Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
# Optimal distance which can be used as an input for clustring data is @50
# Set cut-off to max_d = 50
from scipy.cluster.hierarchy import fcluster
z=fcluster(Z, t=50, criterion='distance')
z
plt.scatter(hcdf[0],hcdf[1],c=z)